Skip to content

SOLR-18220 Add support for countDist in rollup for streaming expressions#4394

Open
KhushJain wants to merge 3 commits intoapache:mainfrom
KhushJain:SOLR-18220
Open

SOLR-18220 Add support for countDist in rollup for streaming expressions#4394
KhushJain wants to merge 3 commits intoapache:mainfrom
KhushJain:SOLR-18220

Conversation

@KhushJain
Copy link
Copy Markdown
Contributor

@KhushJain KhushJain commented May 3, 2026

https://issues.apache.org/jira/browse/SOLR-18220

Description

Rollup function to support countDist (count distinct) statistics in /stream handler.

Solution

Implement the existing CountDistinctMetric stub in the streaming expressions framework:

CountDistinctMetric.java:

  1. Was a no-op stub. update() did nothing and getValue() returned null. Now uses HashSet<Object> to track distinct non-null values per group.
  2. Fixed toExpression() which emitted a spurious outputLong parameter producing malformed expressions like countDist(a_i,true)
  3. Added parameter count validation in theStreamExpression constructor.

Tests

Updated existing tests in StreamDecoratorTest.java and StreamingTest.java:

  • StreamDecoratorTest.testRollupStream: Added countDist(a_i) and countDist(a_s) to the expression and asserted.
  • StreamDecoratorTest.testHashRollupStream: Same additions for hash-based rollup.
  • StreamingTest.testRollupStream: Added CountDistinctMetric("a_i") and CountDistinctMetric("a_s") to the metrics array and asserted, including the null grouping field test.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions Bot added documentation Improvements or additions to documentation client:solrj tests labels May 3, 2026
@epugh epugh requested a review from Copilot May 4, 2026 12:11
@epugh
Copy link
Copy Markdown
Contributor

epugh commented May 4, 2026

So was this a partly started feature that was never finished??

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds countDist support to streaming-expression rollups by turning CountDistinctMetric from a stub into a working metric, updating rollup docs, and extending rollup tests.

Changes:

  • Implement CountDistinctMetric value tracking and simplify its emitted stream expression.
  • Extend rollup and hash-rollup tests to assert distinct counts for integer/string fields.
  • Document countDist(col) as a supported rollup metric and add an unreleased changelog entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java Implements client-side distinct counting and updates expression parsing/serialization.
solr/solrj-streaming/src/test/org/apache/solr/client/solrj/io/stream/StreamingTest.java Adds programmatic rollup assertions for CountDistinctMetric.
solr/solrj-streaming/src/test/org/apache/solr/client/solrj/io/stream/StreamDecoratorTest.java Adds expression-based rollup/hashRollup assertions for countDist and updates fixture data.
solr/solr-ref-guide/modules/query-guide/pages/stream-decorator-reference.adoc Documents countDist(col) in rollup supported metrics and syntax example.
changelog/unreleased/SOLR-18220-support-countdist-in-rollup.yml Adds the unreleased changelog entry for the feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 74 to +78
public void update(Tuple tuple) {
// Nop for now
Object value = tuple.get(columnName);
if (value != null) {
distinctValues.add(value);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROX_COUNT_DISTINCT / hll constant existed but the code was a complete no-op. I'd argue that's a separate feature and needs its own function registered like hll()

Comment on lines +58 to +60
if (1 != expression.getParameters().size()) {
throw new IOException(
String.format(Locale.ROOT, "Invalid expression %s - unknown operands found", expression));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old constructor silently ignored the second parameter anyway, so no one could have depended on it. This PR fixes the serialization to match what the constructor actually accepts

Comment on lines 97 to +98
public StreamExpressionParameter toExpression(StreamFactory factory) throws IOException {
return new StreamExpression(getFunctionName())
.withParameter(columnName)
.withParameter(Boolean.toString(outputLong));
return new StreamExpression(getFunctionName()).withParameter(columnName);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the SQL module's map-reduce path, not the streaming expression path. This PR targets countDist in streaming expression rollup()

@KhushJain
Copy link
Copy Markdown
Contributor Author

So was this a partly started feature that was never finished??

The class existed and worked for facets and stats where it was just used as an identifier that gets push down to json facet api to do the actual computation. So update()/getValue() was never called in that workflow.
For rollup, it needs implementations for these methods to work. Currently countDist was just returning null. So even though the method is registered, it would have just returned null when invoked. This PR fills in that implementation.

@KhushJain
Copy link
Copy Markdown
Contributor Author

@epugh The PR check failed on 2 existing flaky tests on CloudConsistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client:solrj documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants